library(readr)
library(dplyr)
library(tidyr)
library(ggplot2)
library(waffle)
library(plotly)
The datasets were downloaded from Disney+
Movies and TV Shows | Kaggle and TV
shows on Netflix, Prime Video, Hulu and Disney+ | Kaggle. You can
read about them there, including variable definitions, sources, when
they were created, and other information. Load the two datasets and use
glimpse() to explore their structures.
disneydata <- readRDS("/home/students/lyonscf/STT 2860/STT2860F22project3/data/disneypluscontent.rds")
streamingdata <- readRDS("/home/students/lyonscf/STT 2860/STT2860F22project3/data/streamingcontent.rds")
Use select() to delete the variables
director, cast, country,
listed in, and description from the
dataset.
disneyedits <- disneydata %>%
select(show_id, type, title, date_added, release_year, rating, duration, duration_unit)
I used a function called pivot_longer() on the
downloaded data to change the shape of the dataset. You will need to do
additional necessary editing on the dataset before you analyze it.
filter() to remove any row where YesNo
is 0 (a 0 means it is not on the
service).separate() function to split IMDb.
Separate the show rating from the max rating of 10.separate() function to split
RottenTomatoes. Separate the show rating from the max
rating of 100.mutate() to convert the shows’ IMDb and Rotten
Tomatoes ratings into numerical variables instead of categorical.streamingedits <- streamingdata %>%
filter(!YesNo == 0) %>%
separate(col = IMDb, into = c('IMDbRating', 'IMDbMaxRating'), sep='/') %>%
separate(col = RottenTomatoes, into = c('RottenTomatoesRating', 'RottenTomatoesMaxRating'),
sep='/') %>%
mutate(IMDbRating = as.numeric(IMDbRating),
IMDbMaxRating = as.numeric(IMDbMaxRating),
RottenTomatoesRating = as.numeric(RottenTomatoesRating),
RottenTomatoesMaxRating = as.numeric(RottenTomatoesMaxRating))
These plots use the Disney+ Dataset.
A frequency polygon (geom_freqpoly()) is an alternative
to a histogram. Rather than displaying bars, it connects the midpoints
of a histogram’s bars with line segments. Create a frequency polygon for
the year in which Disney+ content was released. Add an appropriate title
and axis labels. Use other formatting as you choose to enhance
effectiveness/appearance.
ggplot(disneyedits, aes(release_year)) +
geom_freqpoly() +
scale_x_continuous(breaks = seq(1920, 2020, by = 10)) +
labs(x = "Release Year", y = "Release Count", title = "Disney Content Released") +
theme_linedraw()
Create a violin plot of release_year (x-axis) grouped by
type of program (y-axis) for content on Disney+. Fill with
a color of your choice. Add a boxplot inside the violin plot, as you did
in one of the DataCamp exercises. Re-scale the x-axis so that tick marks
appear at whole-decade intervals (e.g., 1980, 1990, 2000). Add an
appropriate title and axis labels. Use other formatting as you choose to
enhance effectiveness/appearance.
ggplot(disneyedits, aes(x = release_year, y = type)) +
geom_violin(trim = FALSE, fill ='#702963')+
geom_boxplot(width = 0.1) +
theme_minimal() +
scale_x_continuous(breaks = seq(1920, 2020, by = 10)) +
labs(x = "Release Year", y = "", title = "Disney Content Releases")
This plot uses the Disney+ Dataset.
Create a waffle plot (which you learned in DataCamp: Visualization
Best Practices in R) to display the distribution of program
type on Disney+.
Hint: Use
round(100 * prop.table(table(DATASETNAME$VARIABLENAME))) to
create the “case_counts” data for the waffle plot. Swap out the capital
letter placeholders in the instructions for the correct dataset name and
variable name.
case_counts <- round(100 * prop.table(table(disneyedits$type)))
waffle(case_counts) +
labs(title = "Streaming Content on Disney+", x = "Square = 1 Streaming Content") +
scale_fill_manual(values = c("#702963", "black")) +
guides(fill = guide_legend(title = "Type"))
This plot uses the Disney+ Dataset.
Create one other plot of your choice from the Disney+ Dataset to explore a question of interest. You are welcome to perform additional manipulations on the data, if needed. Add an appropriate title and axis labels, as well as any other necessary formatting.
disneymovies <- disneyedits %>%
filter(type == "Movie") %>%
mutate(duration = as.numeric(duration))
disneychoice <- ggplot(disneymovies, aes(x = release_year, y = duration, text = paste("Title:", title,"<br> Release Year:", release_year, "<br> Minutes:", duration))) +
geom_point() +
scale_x_continuous(breaks = seq(1920, 2020, by = 10)) +
scale_y_continuous(breaks = seq(0, 180, by = 30)) +
labs(x = "Release Year", y = "Duration (min)", title = "Disney+ Movies") +
theme_minimal()
ggplotly(disneychoice, tooltip = "text")
This plot uses the Streaming Dataset.
Create a barplot to display how many shows are offered on each of the four streaming services. Choose appropriate colors, labels, themes, and/or and other types of formatting that you feel will enhance the meaning or visual appearance of the plot.
ggplot(streamingedits, aes(Service, fill = Service)) +
geom_bar(width = .25) +
labs(title = "Streaming Service Shows", y = "Count") +
scale_fill_manual(values = c('#153866', '#66aa33', '#E50914', '#00A8E1')) +
theme_minimal()
This plot uses the Streaming Dataset.
Create one other plot of your choice from the Streaming Dataset to explore a question of interest. You are welcome to perform additional manipulations on the data, if needed. Add an appropriate title and axis labels, as well as any other necessary formatting.
bestshows <- streamingedits %>%
filter(RottenTomatoesRating >= 90)
ggplot(bestshows, aes(Service, fill = Service)) +
geom_bar(width = .25) +
scale_y_continuous(breaks = seq(0, 17, by = 1)) +
labs(title = "90+ Rated Streaming Service Shows", y = "Count") +
scale_fill_manual(values = c('#153866', '#66aa33', '#E50914', '#00A8E1')) +
theme_minimal()
Question 1: Based on your plots, make five informational statements or comparisons regarding the Disney+ streaming service.
ANSWER
There was a spike in content release for Disney from 2010-2021.
According to Rotten Tomatoes Hulu and Netflix offer the best rated shows.
Netflix offers the most variety of shows.
Avengers’ Endgame is the longest duration movie available to stream on Disney+.
Disney+ content consists of more movies than TV shows.
Question 2: What other data would you like to have, or which existing variables would you like to see transformed, if you were going to do further explorations or visualizations? Give at least two examples.
ANSWER
I would like to have movie data with the streaming shows data. Data analysis could really show which service ranks the best having all content. I would also like to have revenue produced from each movie and TV show. It would be interesting to compare the best movies and TV shows
Question 3: Explain the rationale behind the choices you made with regard to plot type, formatting, and so on, when you created Visualizations 3 and 5. Walk me through your process. What motivated your decisions?
ANSWER
With Visualization 3 I wanted to create a convenient and
comprehensive ploty similar to the plotlys
made with App State Baseball. Highlighting over each data point shows
information not received through a static plot. I wanted to view the
outliers and if duration correlated with release year. I was surprised
that Disney released movies with high duration like The Sound of Music
dating back to 1960. However the longest movie was released in 2019
(Avengers’ Endgame).
With Visualization 5 I wanted to view which streaming service had the best rated shows according to Rotten Tomatoes with filtering 90+ rating. I chose Rotten Tomatoes rather than IMDb because of their well known criticized ratings. It is extremely difficult to receive an “A” rating. With that said, not to my surprise, Hulu and Netflix had the same amount of “A” rated shows.
sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux
Matrix products: default
BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] plotly_4.10.1 waffle_0.7.0 ggplot2_3.4.0 tidyr_1.2.1 dplyr_1.0.10
[6] readr_2.1.3
loaded via a namespace (and not attached):
[1] tidyselect_1.2.0 xfun_0.35 bslib_0.4.1 purrr_0.3.5
[5] colorspace_2.0-3 vctrs_0.5.1 generics_0.1.3 viridisLite_0.4.1
[9] htmltools_0.5.3 yaml_2.3.6 utf8_1.2.2 rlang_1.0.6
[13] jquerylib_0.1.4 pillar_1.8.1 glue_1.6.2 withr_2.5.0
[17] DBI_1.1.3 RColorBrewer_1.1-3 lifecycle_1.0.3 stringr_1.5.0
[21] munsell_0.5.0 gtable_0.3.1 htmlwidgets_1.5.4 evaluate_0.18
[25] labeling_0.4.2 knitr_1.41 tzdb_0.3.0 fastmap_1.1.0
[29] extrafont_0.18 crosstalk_1.2.0 fansi_1.0.3 highr_0.9
[33] Rttf2pt1_1.3.11 scales_1.2.1 cachem_1.0.6 jsonlite_1.8.3
[37] farver_2.1.1 gridExtra_2.3 hms_1.1.2 digest_0.6.30
[41] stringi_1.7.8 grid_3.6.0 cli_3.4.1 tools_3.6.0
[45] magrittr_2.0.3 sass_0.4.4 lazyeval_0.2.2 tibble_3.1.8
[49] extrafontdb_1.0 pkgconfig_2.0.3 ellipsis_0.3.2 data.table_1.14.6
[53] assertthat_0.2.1 rmarkdown_2.18 httr_1.4.4 rstudioapi_0.14
[57] R6_2.5.1 compiler_3.6.0